Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Updates to fluence to restore functionality and update builds #61

Closed
wants to merge 25 commits into from

Conversation

vsoch
Copy link
Member

@vsoch vsoch commented Feb 17, 2024

This PR will likely be open for a few days while I carefully work through needed changes. I want to try and commit in a meaningful way (despite my previous commits being a bit messy I think) and have the diffs here to inspect. In particular, I plan to do the following:

Docker Builds

  • Update fluence container build to use flux-framework/fluxion-go (works!)
    • Simplify corresponding Dockerfile to use flux-sched base
  • Update the build to include a new container, ghcr.io/flux-framework/fluence-controller
    • Also update the automated build to deploy this container
  • Rename the helm install to something easy to remember, "fluence"

Miscellaneous

  • Remove unused manifests/fluence
  • Add reproducing examples for pod group clogging, and kind config
  • Improve developer documentation in README
  • Design documentation (diagram) in docs/README.md

Resources

  • Fluence should account for itself in resources (moving to future issue)
  • Control planes should not be scheduled to
  • Resource derivations should be parsing all containers (not just first)
  • Likely job name needs to be group, if pod group found (and not pod name)

Services

  • grpc to support service that exposes scheduler for kubectl fluence

PodGroup

  • Add the controller from upstream kubernetes-sigs to build here (anticipating working on PodGroup here)
  • Millisecond timestamps (this will need to be done again)
  • Refactor back to use PodGroup
    • PodGroup creation should happen in reconciler we control
    • PodGroup deletion, where (update, owner references didn't take, so we delete when pods finished/failed >= min members of group)

Controller

  • Build of fluence-controller here
  • Create mutating admission webhook for PodGroup
    • Mutating admission webhook should watch for pods
    • Mutating admission webhook should watch for jobs (and add to PodTemplate)
    • Webhook should handle other pod abstractions

Update removed testing cases, because the deployment / statefulset etc. cleanup strategies need further thinking and discussion, unlikely cases so not priority.

This is a WIP, will change the title if/when ready for review, and add points as I work on / find them. The criteria (for me) for this to be ready will be resolving the clogging use case, and consistently working for the larger test case on GCP.

vsoch added 10 commits January 7, 2024 21:37
Also tweak some docstrings for fluence.

Signed-off-by: vsoch <[email protected]>
Problem: the local upstream might get out of date
Solution: provide an easy make update to pull from it.
Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>

There are too many edge cases / too much complexity and behavior
that I do not understand to pursue having the pod group information
cached with fluence. For now I am nuking it and testing the intial
design as a sanity check.
Problem: the podgroups with second precision have interleaving
Solution: try to create an internal representation with better
precision. This looks promising with early testing, but I need
to consider the edge cases and how to clean up the groups, otherwise
a pod group might be re-created later and still in the cache.

Signed-off-by: vsoch <[email protected]>
We install with the helm manifests, and the old fluence
manifests might be confusing (they have changed). This commit
will remove the old manifests, and also change some of the fmt.Print
logging to use klog to be easier to parse.

Signed-off-by: vsoch <[email protected]>
This adds a prototype support for an extra helm flag that
dually enables adding an extra grpc set of endpoints, and then
the configs (ingress and service) necessary to expose them.
I next need to figure out how to interact with grpc from
a local client, likely built from the same codebase and grpc
spec. This is super cool!!

Signed-off-by: vsoch <[email protected]>
Problem: we want to be able to persist PodGroup if upstream removes it
Solution: build our own controller image, also allowing us to
tweak it to enhance fluence. This commit also renames the helm
install to be "fluence" so it is easier for the developer
workflow

Signed-off-by: vsoch <[email protected]>
@vsoch vsoch mentioned this pull request Feb 17, 2024
@vsoch vsoch force-pushed the fluence-controller branch 3 times, most recently from 64d9f29 to 458f73c Compare February 17, 2024 23:32
@vsoch vsoch force-pushed the fluence-controller branch from 458f73c to 1c0e5a3 Compare February 17, 2024 23:39
Problem: we want to try a design where a mutating admission webhook can handle
receiving and creating PodGroup from labels. We are choosing mutating with
the expectation that, at some point, we might be able to change the size
(min/max/desired) either for the PodGroup or some other watcher to jobs.
Note that this is an empty skeleton - the webhook is added and running
but basically doing nothing. I am also fixing a bug that I noticed while
running kind that fluence was assigning work to the control plane. I think
there maybe used to be logic (a commented out worker label) that was
anticipating doing a check for a control plane, but it looks like on
production clusters we do not always haave access and it was never
finished. Note that this addition does not guarantee this design will
work, but it is just one step. Since the helm charts are manually
genereated for the scheduler-plugin (as far as I can tell) this
took me almost 6 hours to figure out and get working. I am really
starting to think there is no skill behind software engineering
beyond absolute patience.

Signed-off-by: vsoch <[email protected]>
Problem: we need every pod object coming into the cluster to be
part of a group.
Solution: This change adds logic to the mutating webhook to
add the labels that indicate the group name and size. We
can eventually add flexibility here. I also realize that
we can easily watch for job objects first, and add the
group size/name to the pod template. This will be much
more efficient to then not have to add to the individual
pods that are part of a larger job. With this approach
I was able to create a fluence scheduled pod, and then
see my labels added! It does not do anything beyond that.
I am also adding a nice script that makes it easy to
build, load, and install fluence freshly, otherwise you
will use all your internet data for the month in like,
two days. Do not do that :P

Signed-off-by: vsoch <[email protected]>
@vsoch vsoch force-pushed the fluence-controller branch from 2029a1d to 10d624d Compare February 18, 2024 03:55
@vsoch
Copy link
Member Author

vsoch commented Feb 18, 2024

Ah this is really interesting - the PodGroup reconciler was already watching pod so there was never a need to manually create the PodGroup - given the label, it would have been created for us. The subtle difference between then and now is that we want everything scheduled by fluence to have a group, so given our added webhook that adds the label, we would only hit this case if a Pod came through that was scheduled by the default scheduler (because it wouldn't be using the controller alongside fluence, but would still be watching all Pods in the cluster). But maybe there is a use case to do that for combining across objects with pods, or asking for a different size than the number of pods in some set. Will think more about it tomorrow / later today. 💤

Ah I don't think that's right (was looking at my updated code) - in the original it would hit here and not be found, so it does need something to create it.

Problem: we want the labels (size and name) that are explicitly set
to lead to the creation of the pod group so the user does not need to.
This is done by way of a watcher on pod, which will trigger after
the webhook that ensures that every pod (in a job or single pod)
has the proper label. Likely we want to do this for other abstractions
that hold pods as well, because it ensures that no matter how the pods
go into pending, we have the correct size and name. The only case that
a pod can come in without the label means that it was not scheduled
by fluence. The user is directed to not do this, but it is not
impossible (e.g., fluence sees itself show up here actually). So after
this addition we have the full steps to add the labels and create
the pod group, and next steps are (finally) to integrate this into
fluence (and remove the old abstraction to store it in memory).

Signed-off-by: vsoch <[email protected]>
@vsoch
Copy link
Member Author

vsoch commented Feb 18, 2024

See last commit message for recent update: 000baac

The PR now supports the full pipeline of a webhook adding labels, a watcher on pod to get them and trigger a PodGroup reconcile, and creation of the PodGroup. In a diagram:

apply -> 
  podgroup webhook -> 
    add labels (size/name) -> 
      watch on pod in PodGroup reconciler --> 
        if Pending -> PodGroup

Here is a an example PodGroup created by a Job that has completions/parallelism set to 3 - we see that the size reflects that.

image

And then the PodGroup is created and enters the reconcile process akin to any other CRD. Note that I also decided to keep / use the orginal label for the name - I didn't see a strong reason to change it, and was worried about potential bugs that having two might cause in parts of the code we don't reproduce / aren't familiar with.

At this point we should have enough to (finally) integrate this back into fluence, of course the changes are now that the PodGroup creation (size detection) is automated, the timestamp is updated to be Microseconds, and we have better ownership / control over how the group and associated metadata is handled (good for future ideas I think).

Still no guarantee any of this will work, but i can say that we haven't failed yet.

vsoch and others added 8 commits February 18, 2024 18:16
Problem: fluence should only be storing state of jobid and
presence of a group name in a map to indicate node assignment.
Soluion: update the code here. Note that this is not working
yet, and I am pushing / opening the PR to not use the work
(and will update accordingly, and using this PR to test).

Signed-off-by: vsoch <[email protected]>
[WIP] fluence: refactor to use new PodGroup
Problem: Since the PodGroup controller creates the PodGroup, it
should delete it as well.
Solution: Ideally I wanted to attach an owner reference, meaning
that the top level job (that also owns the pod) would be owner
to the PodGroup. But that does not seem to take - either because
the controller is the owner or the field is read only for k8s.
For the time being, I decided to delete the PodGroup when the
group is determined to be Finished/Failed, which happens when
that number of pods equals or exceeds the MinimumSize. I think
granted that MinimumSize == size this should be OK with fluence,
and we might need to consider other approaches if/when the
min size is smaller than the total size (because fluence might
still see a pod in the queue and try to schedule again. I think
what we might do in that case is just update the MinSize for
the group, so if fluence schedules again it will be for the
smaller size. But not sure about that either! TBA. The
important thing now is that the pod group cleans itself up!

Signed-off-by: vsoch <[email protected]>
Problem: we need to be able to run deployments, stateful/replica sets
and have them handled by fluence.
Solution: allow the webhook to create pod groups for them. In the case
they are not targeted for fluence (any abstraction) and get into the
PreFilter, allow creation of a FauxPodGroup that will simply schedule one
job for the pod. We do this twice - in PreFilter and in the events
for update/delete.

Signed-off-by: vsoch <[email protected]>
feat: add support for other abstractions
Problem: I noticed in testing that the time only had granularity down to the second.
Solution: It appears that when we do a create of the PodGroup from the reconciler watch,
the metadata (beyond name and namespace) does not stick. I am not sure why, but the labels
are still retrievable from the pods (via the mutating webhook) after. So instead, we need
to get the size and creation timestamp at the first hit in reconcile, which (given how that
works) should still somewhat honor the order. I did try adding the timestamp to a label
but it got hairy really quickly (kept me up about 3 hours longer than I intended to!) The
good news now is that I see the microseconds in the Schedule Start Time, so we should be
almost ready to test this on a GCP cluster. I also had lots of time waiting for the containers
to rebuild so I made a diagram of how it is currently working. I have some concerns about
the internal state of fluxion (my kind cluster stopped working after some hours and I do not
know why) but we can address them later. We mostly need to see if there are jobs that are being
forgotten, etc.

Signed-off-by: vsoch <[email protected]>
bug: the metav1.MicroTime was not being set
Problem: the design description did not correspond with the numbers
Solution: fix them up, also fix some bugs in the controller and
fluence that assume we have pods / pod group (we do not always)

Signed-off-by: vsoch <[email protected]>
vsoch and others added 3 commits February 19, 2024 08:51
I am making small changes as I test on GKE and EKS. My first tests on GKE had
me creating / deleting jobs, and I think the state of fluence (fluxion) got out
of sync with the jobs, meaning that fluxion thought jobs were running that were not
and then was unable to allocate new ones. To adjust for that we can add back in the
cancel response, but this will only work given that fluence has not lost memory
of the job id. We likely need an approach that can either save the jobids to the
state data (that could be reloaded) or a way to inspect jobs explicitly and purge,
OR (better) a way to look up a job not based on the id, but based on the group id
(the command in the jobspec). That way, regardless of a jobid, we could lose all
of our state and still find the old (stale) job to delete. With a fresh state
and larger cluster I am able to run jobs on GKE, but they are enormously slow -
lammps size 2 2 2 is taking over 20 minutes. This is not the fault of fluence -
GKE networking sucks. To keep debugging I likely need to move over to AWS with
EFA, of course that introduces more things to figure out like EFA, etc.

Signed-off-by: vsoch <[email protected]>
@vsoch vsoch closed this Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant